Integrating Data and Probabilistically Structured Text Documents

نویسندگان

Karsten Winkler

Myra Spiliopoulou

چکیده

Commercial, non-profit and public organizations are accumulating huge amounts of electronically available text documents. Although composed of unstructured texts, documents contained in archives such as annual reports to shareholders, medical patient records and public announcements often share an inherent, though undocumented structure. In order to enable information integration of text collections with related structured data sources, this inherent structure should be made explicit as detailed as possible. The goal of this study is the establishment of a methodology for the integration of text documents with structured records into a hyper-archive of application-specific entities. The text documents are of implicit structure which has been explicated by data mining techniques as proposed in the DIAsDEM framework for semantic tagging of domain-specific text documents. The result is a probabilistic DTD that serves as a basis for the matching of schemata and for the matching of data instances.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Analytics to Data Warehousing

─ Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like X...

متن کامل

Integrating a Structured-Text Retrieval System with an Object-Oriented Database System

We describe the integration of a structured-text retrieval system (TextMachine) into an object-oriented database system (OpenODB). Our approach is a light-weight one, using the external function capability of the database system to encapsulate the text retrieval system as an external information source. Yet, we are able to provide a tight integration in the query language and processing; the us...

متن کامل

Exploiting Evidence from Unstructured Data to Enhance Master Data Management

Master data management (MDM) integrates data from multiple structured data sources and builds a consolidated 360degree view of business entities such as customers and products. Today’s MDM systems are not prepared to integrate information from unstructured data sources, such as news reports, emails, call-center transcripts, and chat logs. However, those unstructured data sources may contain val...

متن کامل

Learning to Classify Text from Labeled and Unlabeled Documents

In many important text classification problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argume...

متن کامل

Using EM to Classify Text from Labeled and Unlabeled Documents

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical ar...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Integrating Data and Probabilistically Structured Text Documents

نویسندگان

چکیده

منابع مشابه

Text Analytics to Data Warehousing

Integrating a Structured-Text Retrieval System with an Object-Oriented Database System

Exploiting Evidence from Unstructured Data to Enhance Master Data Management

Learning to Classify Text from Labeled and Unlabeled Documents

Using EM to Classify Text from Labeled and Unlabeled Documents

عنوان ژورنال:

اشتراک گذاری